Scalable Feature Selection for Large Sized Databases
نویسنده
چکیده
Feature selection determines relevant features in the data. It is often applied in pattern classiica-tion. A special constraint for feature selection nowadays is that the size of a database is normally very large. An eeective method is needed to accommodate the practical demands. A scalable probabilistic algorithm is presented here as an alternative to the exhaustive and heuristics approaches. The scalable probabilistic algorithm is designed and implemented to meet the needs arising from real-world data mining applications. Through experiments, we show that (1) the probabilistic algorithm is eeective in obtaining optimal/suboptimal subsets of features; (2) its scalable version expedites feature selection further and can scale up without sacriicing the quality of selected features.
منابع مشابه
Massively Parallel Unsupervised Feature Selection on Spark
High dimensional data sets pose important challenges such as the curse of dimensionality and increased computational costs. Dimensionality reduction is therefore a crucial step for most data mining applications. Feature selection techniques allow us to achieve said reduction. However, it is nowadays common to deal with huge data sets, and most existing feature selection algorithms are designed ...
متن کاملFeature selection with test cost constraint
Feature selection is an important preprocessing step in machine learning and data mining. In real-world applications, costs, including money, time and other resources, are required to acquire the features. In some cases, there is a test cost constraint due to limited resources. We shall deliberately select an informative and cheap feature subset for classification. This paper proposes the featu...
متن کاملSome Issues on Scalable Feature Selection
Feature selection determines relevant features in the data. It is often applied in pattern classiication, data mining, as well as machine learning. A special concern for feature selection nowadays is that the size of a database is normally very large, both vertically and horizontally. In addition, feature sets may grow as the data collection process continues. EEective solutions are needed to a...
متن کاملAn Overview of the New Feature Selection Methods in Finite Mixture of Regression Models
Variable (feature) selection has attracted much attention in contemporary statistical learning and recent scientific research. This is mainly due to the rapid advancement in modern technology that allows scientists to collect data of unprecedented size and complexity. One type of statistical problem in such applications is concerned with modeling an output variable as a function of a sma...
متن کاملHigh Dimensional Feature Indexing Using Hybrid Trees
Feature based similarity search is emerging as an important search paradigm in database systems. The technique used is to map the data items as points into a high dimensional feature space which is indexed using a multidimensional data structure. Similarity search then corresponds to a range search over the data structure. Traditional multidimensional data structures (e.g., R-tree, KDB-tree, gr...
متن کامل